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I .  Introduction 

Annually,  the  U.S.  Army  Safety  Center  (USASC) ,  Fort  Rucker, 
Alabama,  analyzes  approximately  20,000  reports  of  accidents  to 
identify  cause  factors,  develop  countermeasures  and  evaluate 
effectiveness  of  fielded  countermeasures.  Efficient  analysis  of 
such  massive  data  requires  use  of  methods  of  data  reduction  to 
reveal  the  essential  targets.  One  of  the  methods  used  by  the  Safety 
Center  is  factor  analysis.  However,  the  nature  of  much  of  the 
accident  data  (binary)  is  not  amenable  to  the  Pearsonian  Product 
Moment  Correlation  Coefficient  that  drives  the  factor  analysis.  A 
substitute  coefficient  (Jaccard  Similarity  Coefficient)  has  been 
employed  by  USASC  but  the  theoretical  foundation  for  using  this 
procedure  has  not  been  established. 

In  Section  1  of  this  report  we  will  examine  the  feasibility  of 
using  similarity  or  matching/ associative  coefficient  as  a 
substitute  for  the  correlation/ covariance  matrix  in  the  factor 
analysis  procedure.  Also,  examined  in  the  report  is  the  definition 
of  dichotomous  scoring  compared  to  the  binary  data  made  available 
to  the  USASC. 

From  Section  1,  it  was  determined  that  the  method  of  data 
reduction  had  many  mathematical  pitfalls.  A  more  efficient  method 
should  be  utilized  to  reduce  the  accident  data.  In  Section  2  of 
this  report,  we  develop  a  methodology  using  accident  data  supplied 
by  the  USASC  (Night  Study  data) .  This  data  is  considered  highly 
parsimonious  in  both  the  physical  interpretation  and  mathematical 
complexity.  The  procedure  which  was  investigated  is  the  VARCLUS 
procedure  contained  in  the  SAS  Institute  INC.  statistical  package 
(SAS) .  It  is  felt  that  the  parsimony  mentioned  above  is  minimized 
by  using  this  procedure. 

The  VARCLUS  procedure  was  investigated  because  of  its 
usefulness  in  interpreting  large  amounts  of  variables.  VARCLUS  is 
a  variable- reduction  method  and  it  is  also  useful  in  determining  if 
there  is  a  relationship  between  variables. 

It  should  be  noted  that  the  analysis  performed  in  this  report 
was  done  using  a  Clustering  Technique  (Centroid  Method)  to 
duplicate  Factor  Analysis  Procedure  established  by  the  USASC. 
Basically  while  performing  this  analysis,  we  found  ourselves 
working  towards  answers  already  achieved.  This  technique  allowed  a 
thorough  evaluation  of  the  mathematical  procedures  used  by  the 
USASC. 
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Section  1:  Mathematical  Analysis 


A.  Definitions  and  Terminology 

While  investigating  literature  about  Factor  Analysis  and 
Cluster  Analysis,  it  became  apparent  that  one  term  may  have  many 
different  names.  The  name  given  to  a  term  depends  primarily  on  the 
author ' s  background . 

One  such  example  is  the  correlation  coefficient.  In 
mathematics,  the  term  is  usually  referred  to  as  R.  Phi  (#)  is 
mathematically  equivalent  to  R  when  both  variables  are  dichotomous. 
There  are  many  other  names  for  the  correlation  coefficient  some  of 
the  more  common  ones,  with  their  associated  mathematical  equation 
and  references,  are  listed  in  this  section. 

The  following  is  a  list  of  correlation  coefficients  which  can 
be  algebraically  reduced  to  equivalent  equations.  The  only 
differences  between  these  equations  are  the  names  and  the  form  of 
the  equations.  This  is  important  when  reading  any  literature  about 
cluster  analysis  and  factor  analysis,  because  these  terms  are  used 
interchangeably  depending  upon  the  background  of  the  author. 

Product  Moment  Correlation  Coefficient  Pg  85  Anderburg 
Cophentic  Correlation  Coefficient  Pg  26  Romesburg 


r=- 


i-1 _ 11  1-1 _ i- 1 _ 


i-l  "  i-l  i-1  i-1 


.2  1 


Coefficient  of  Correlation  (R) 


Pg  441  Freund  and  Smith 


R=- 


nYtxiyi-(Ytxi)  (£yi) 


1*1 


i*l  i-1 


snC£&-CgXi)^ 


n(Y^y\) 

i-1  i-1 


Pearsonian  Correlation  Coefficient  Pg  42  Basilevskey 

53  {xrx)  (y^y) 

ZAy=  - — - - - - — 

[53  (xi-x)2^  (yry)2]  2 

jTi  jTi 


Correlation  Coefficient  in  Standard  score  form  Pg  27  Comrey 


y:  zikzjk 

_  i- 1 _ 

1  N 


T  (x^x)  (yj-y) 

.  __1 _ i^i _ 

^  ^  A  (xj-x^A  (y.ry) 2 ,  i 
ft  *  ft  N 


n 


53  (x^x)  (yj-y) 


[£  (Xi-x)253  (yj-y)2]  2 

i-l  i- 1 


Phi  Coefficient  P9  149  Romesburg 

ad-bc 

Cjk  _1 

[(a+b)  (a+c)  (i?+d)  (c+d)]  2 


Another  source  of  confusion,  is  attributed  to  the  similarity 
coefficients.  Some  of  the  coefficients  have  as  many  as  three  names 
for  the  same  term.  The  Sorenson  Coefficient  is  also  known  as  the 
Dice  Coefficient  and  Czehanowski ' s  Coefficient.  There  are  many 
similarity  coefficient  some  of  the  more  common  ones,  with  their 
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associated  mathematical  equation  and  references,  are  listed. 

Also,  listed  is  the  qualitative  resemblance  coefficients  with 
the  dissimilarity  matches,  the  0-0  or  d  cell.  It  was  found  that  the 
coefficients  reduce  to  the  Jaccard  Coefficient  or  the  Dice 
Coefficient  when  the  dissimilarities  removed. 

The  list  of  coefficients  below  are  all  clustering 
coefficients.  We  are  listing  them  here  because  they  will  be  used 
throughout  this  text. 

Jaccard  Coefficient 

T  —  £ 

*3  a+b+c 


Dice  Coefficient 
Sorensen  Coefficient 
Czekanowski  Coefficient 

c  -  2a 
ij  2a+b+c 


Gower  Coefficient 


Pg  89  Anderberg 
Pg  145  Romesburg 
Pg  356  Seber 


Pg  31  Aldenderfer  & 
Blashfield 


E  SV* 


E  wW 


where  w..k  has  a  value  of  one  (1)  if  a  comparison  of  variable  k  is 
considered  and  zero  (0)  if  no  comparison  exists,  and  sjj|?  has  a 
value  of  one  (1)  if  i  and  j  are  both  similar.  When  using  with 
dichotomous  data  the  Gower  Coefficient  becomes  the  Jaccard 
Coefficient. 

Many  coefficients  take  the  form  of  the  Jaccard  or  Dice 
Coefficient  when  dissimilarities  are  not  taken  in  consideration. 
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Simple  Matching  Coefficient 


a+d  d=0  a 
^  a+i?+c+d  a+b+c 

Russell  &  Rao: 

a  d=0  a 
a+b+c+d  a+b+c 


Baroni-Urbani  &  Busser  Pg  150  Romesburg 

i 

a+  (ad)  2  d=0  a 

1  -  a+b+c 
a+b+c+  (ad)  2 


Sokal  &  Sneath: 


2  (a+d)  d=  0  2a 
2(a+d)+b+c  "*  2 a+b+c 
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B.  Definition  of  Scoring 

The  Safety  Center  data,  which  we  are  observing  contains, 
aircraft  accidents.  Each  accident  is  looked  at  in  terms  of  fifty- 
six  (56)  variables.  Binary  data  is  used  in  which  one  (1)  indicates 
the  presence  of  that  particular  variable  and  a  zero  (0)  indicates 
the  absence  of  that  particular  variable. 

These  56  variables  can  be  broken  down  into  ten  (10) 
categories. 

A.  Fatigue 

B.  Illumination 

C.  Aircraft 

D.  Mission 

E .  Aided 

F.  Problem  Area 

G.  Pilot  in  Control 

H.  Task  Error 

I.  Phase  of  Flight 

J.  Experience 

The  variables  within  each  category  are  independent.  For 
example,  each  individual  accident  can  only  involve  one  type  of 
aircraft. 

During  the  investigation  of  the  Safety  Center  data,  it  was 
determined  we  are  not  dealing  with  truly  dichotomous  data. 
Dichotomous  data  is  defined  as  data  which  is  transformed  to  a 
binary  set,  1  equalling  the  occurrence  of  an  event  and  0  being  the 
complement  of  that  event.  Two  sets  of  data  are  transformed  and 
examined.  The  resultant  data  set  is  made  up  of  four  (4) 
independent  outcomes. 


VARIABLE/ 
EVENT  2 

VARIABL] 

2 /EVENT  1 

1 

0 

1 

a 

b 

0 

c 

d 

1  -  indicates  the  presence  of  that  variable/ event 
0  -  indicates  the  absence  of  that  variable/ event 

a  -  the  #  of  times  a  variable/ event  1  occurred  and 
variable/event  2  occurred. 

b  -  the  #  of  times  a  variable/ event  2  occurred  and 
variable/ event  2  did  not  occur, 
c  -  the  #  of  times  a  variable/event  1  did  not  occur  and 
variable/ event  2  occurred. 
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d  -  the  #  of  times  neither  variable/ event  occurred. 

The  following  example  illustrates  true  dichotomous  data. 
Variable  1  is  defined  as  a  turning  error  on  a  UH1  aircraft,  and 
variable  2  is  defined  as  a  UH1  failure.  The  definitions  for  the 
a,b,c,  and  d  terms  are  as  follows: 

a  -  the  #  of  times  a  UH1  failed  and  there  was  a  turning 
error. 

b  -  the  #  of  times  a  UH1  failed  and  there  was  not  a 
turning  error. 

c  -  the  #  of  times  a  UH1  did  not  fail  but  there  was  a 
turning  error. 

d  -  the  #  of  times  a  UH1  did  not  fail  and  there  was  not 
a  turning  error. 

The  Safety  Center  utilizes  binary  scoring  for  their  data.  The 
difference  between  the  Safety  Center  data  and  purely  dichotomous 
data  is  the  definition  of  the  zero(O)  terms  for  each  variable. 

The  following  is  an  example  of  the  Safety  Center  data: 

VARIABLE  1 

1  -  tuning  error  was  made 

0  -  some  other  error  was  made 


VARIABLE  2 

1  -  failure  of  a  UH1 

0  -  some  other  aircraft  failed 

If  the  Safety  Center  was  dealing  with  only  two  aircraft  and 
two  types  of  errors  this  scoring  would  be  acceptable.  However,  the 
Safety  Center  is  dealing  with  seven  aircraft  and  a  number  of  errors 
or  other  variables. 

Since  the  Safety  Center  data  is  defined  in  such  a  way,  any 
evaluation  involving  0  -  terms  will  be  inaccurate.  In  the  case  of 
the  c  values,  we  are  not  comparing  turning  errors  made  on  UHls  that 
did  not  have  accidents.  We  are  complicating  the  comparison  away 
from  a  purely  dichotomous  relationship  by  broadening  the 
relationship  to  comparing  turning  errors  which  contributed  to 
accidents  on  all  other  aircraft.  These  values  have  no  correlation 
with  UHls  but  are  expected  to  be  used  in  the  calculation  of  a 
coefficient  which  will  give  a  numerical  value  to  the  relationship 
between  turning  errors  and  UH1  aircraft. 

When  purely  dichotomous  data  was  used,  it  created  another 
difficulty  for  this  type  of  scoring.  An  example  is  the  use  of 
aided  and  unaided  as  two  separate  variables.  When  they  were 
transformed  into  dichotomous  data  the  definitions  were  as  follows: 
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AID 

1  -  pilot  was  aided 
0  -  pilot  was  unaided 

UNA 

1  -  pilot  was  unaided 
o  -  pilot  was  not  unaided 

The  problem  arises  when  these  variables  are  compared.  The  AID 
-  0  is  logically  equivalent  to  the  UNA  -  1  ,  and  the  UNA  -0  is 
equivalent  to  the  AID  -1.  This  would  cause  double  weighing  of 
these  variables  when  they  are  compared  with  other  variables. 
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C.  Compare  Factor  Analysis  with  Modified  Factor  Analysis 

After  the  data  has  been  collected,  the  variables  can  then  be 
investigated  to  determine  which  ones  will  be  used  in  the  analysis 
of  the  data.  This  procedure  can  be  done  by  clustering  and  then 
some  sort  of  analysis,  or  simply  using  factor  analysis  on  the  raw 
data. 


If  clustering  is  performed,  the  first  step  would  be  to  group 
those  variable  which  are  closely  associated  with  one  another.  There 
are  many  coefficients  which  achieve  these  correlations.  One  such 
coefficient  commonly  used  by  the  Safety  Center  is  the  Jaccard 
Similarity  Coefficient.  This  coefficient  identifies  similarities 
and  disregards  all  dissimilarities.  Once  the  clustering  has  been 
performed,  analysis  can  be  done  by  simply  looking  at  the  clusters 
or  by  some  further  statistical  methods  such  as  factor  analysis. 

Whatever  the  goals  of  an  analysis,  in  most  cases  it  will 
involve  the  following  major  steps:  (a)  selecting  variables;  (b) 
computing  the  matrix  of  correlation/covariance  among  the  variables; 
(c)  extracting  the  unrotated  factors;  (d)  rotating  factors;  (e) 
interpreting  the  rotated  factor  matrix. 

One  common  objective  of  the  factor  analysis  is  to  provide  a 
relatively  small  number  of  factor  constructs  that  will  serve  as 
satisfactory  substitutes  for  a  much  larger  number  of  variables. 
The  factor  constructs  themselves  are  variables  that  may  prove  to  be 
more  useful  than  the  original  variables  from  which  they  were 
derived.  The  first  step  would  involve  the  decision  of  using  a 
covariance  matrix  or  a  correlation  matrix  depending  upon  what  the 
results  are  that  the  researcher  is  trying  to  obtain.  Once  this 
matrix  is  obtained,  the  number  of  factors  should  be  found.  This 
can  be  achieved  by  finding  the  rank  of  the  correlation  matrix. 
Another  way  to  find  the  number  of  factors  is  to  determine  the 
eigenvalues  for  each  variable  and  the  number  of  eigenvalues  which 
make  up  a  predetermined  percentage  of  the  cumulative  eigenvalues 
over  the  sum  of  all  eigenvalues  assigned  to  the  variables. 

Once  the  number  of  factors  are  specified,  a  linear  combination 
of  the  variables  can  be  determined  for  each  factor.  The  first 
linear  combination  makes  up  the  largest  amount  of  variance  and  each 
combination  after  that  is  defined  by  a  lesser  amount  of  the 
variance.  From  these  linear  combinations,  it  can  be  determined 
which  variables  are  responsible  for  the  variance. 

The  Safety  Center  initially  uses  the  Jaccard  Similarity 
Coefficient  to  find  variables  within  a  specific  group  (ex. 
illumination)  which  occur  simultaneously.  A  new  variable  is 
defined  which  consists  of  the  two  variables  which  exhibited  a  large 
Jaccard  value. 

The  Safety  Center  then  uses  factor  analysis  to  determine  which 
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variables  are  important  in  aircraft  accidents.  One  of  the 
differences  between  the  Safety  Center  factor  analysis  and  what  was 
described  before  is  the  use  of  the  Jaccard  coefficient  in  place  of 
the  correlation/ covariance  coefficient. 

Jaccard  Distribution  Table 

The  Jaccard  Similarity  Coefficient  is  a  ratio  of  the  number  of 
1-1  matches  (both  variables  occurring  simultaneous)  compared  to  the 
number  of  occurrences  of  every  time  at  least  one  of  the  variables 
are  present  (a+b+c) . 


VARIABLE/ 

VARIABLE /EVENT  1 

EVENT  2 

1 

0 

1 

a 

b 

0 

c 

d 

1  -  indicates  the  presence  of  that  variable/event 
0  -  indicates  the  absence  of  that  variable/ event 


The  Jaccard  Similarity  Coefficient  is  defined  as: 


*■1  a+b+c 


OsJ^sl 


Hence,  when  creating  the  Jaccard  Similarity  Coefficient  Table  it 
was  found  that  the  coefficients  were  uniformly  distributed  on  the 
integer  l,2,...,n,  n  being  the  total  sample  size  (a+b+c). 

f  (x)  —  fox  n=l  ,2 ,  . . . ,  n, 

n 

0  otherwise. 


Therefore,  the  total  sample  size  (a+b+c)  can  be  multiplied  by  the 
needed  confidence  level  to  determine  what  the  minimum  number  of  1-1 
matches  is  needed  to  give  the  tolerable  sample  size. 
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The  following  chart  is  an  example  of  the  Confidence  Levels  of  the 
Jaccard  Coefficients. 

Chart  1. 


n 

Sample  Size  1%  2.5%  5%  10%  90%  95%  97.5%  99% 

(a+b+c) 


The  matrix  shown  below  is  an  of  the  Jaccard  Similarity 
Coefficient  Matrix  all  possible  combinations  calculated  for 
n  =  1/2  f«..f 10 . 


Number  of  "a"  terms 


10 

n 

( a+b+c ) 

10  1 
9 
8 
7 
6 
5 
4 
3 
2 
1 


9  8  7 


0.90  0.80  0.70 

1  0.89  0.78 

1  0.88 
1 


6 

5 

4 

0.60 

0.50 

0.40 

0.67 

0.56 

0.44 

0.75 

0.63 

0.50 

0.86 

0.71 

0.57 

1 

0.83 

0.67 

1 

0.80 

1 


3 

2 

1 

0.30 

0.20 

0.10 

0.33 

0.22 

0.11 

0.38 

0.25 

0.13 

0.43 

0.29 

0.14 

0.50 

0.33 

0.17 

0.60 

0.40 

0.20 

0.75 

0.50 

0.25 

1 

0.67 

0.33 

1 

0.50 

1 
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D.  Jaccard  Coefficient  Insertion 

The  operation  being  scrutinized  was  accomplished  by  the 
insertion  of  the  Jaccard  Coefficient  Matrix  into  the  Factor 
Analysis  problem  at  the  point  where  the  correlation  matrix  or 
covariance  matrix  between  the  variables  is  computed.  The  Jaccard 
Coefficient  is  a  similarity  coefficient.  A  similarity  coefficient 
measures  the  resemblance  between  two  (2)  objects  based  on  either  or 
both  of  two  (2)  logically  distinct  kinds  of  information  pertaining 
to  a  set  of  variables. 

The  similarity  coefficient  provides  information  on  the 
existence  or  absence  of  the  variables  being  compared.  The  use  of 
this  coefficient  is  amicable  when  comparing  attributes  of  an 
object.  Coefficients  of  this  type  are  dichotomous.  The  term 
dichotomous  is  reserved  for  characters  that  are  either  present  or 
absent  and  whose  absence  in  both  of  a  pair  of  objects  is  not  taken 
as  a  match.  This  approach  would  be  appropriate  if  the  variables 
were  all  nominal  with  two  (2)  states,  the  states  simply  being 
alternatives  with  equal  weight. 

If  P  is  the  population  of  variables,  then  we  can  define  a 
similarity  as  a  function  that  maps  P  x  P  into  R1  and  satisfies  the 
following  axioms: 

(1)  0  <  C.(h,i)  £  1  for  all  h  and  i  in  P,  were  P  is  the 

population  of  objects. 

(2a)  C.  (h,h)  =  1. 

(2b)  Ci(h,i)  =  1  only  if  i  *  h. 

(3)  c;.(h,i)  =  Cj-fi^h) . 

The  Jaccard  Coefficient,  Cj(h,i),  satisfies  the  above  axioms. 

With  quantitative  variables,  one  measure  of  similarity  between  xh 
and  x{,  the  observations  on  objects  h  and  i  is  the  correlation  of 
the  pairs  (x^x^) ,  k  =  l,2,...,n,  namely,  the  Pearsonian 

Correlation  Coefficient, 


n 

E  {xh*-Xh. > 

Jr»l _ _ 

tE  <xi*-^.)2E  2 

Jc-l  Jr-l 


and  -1  £  rh5  £  1.  As  can  be  seen  the  Pearsonian  Correlation 
Coefficient  koes  not  satisfy  Axiom  1  above.  Other  problems  are 
observed  which  do  not  make  the  Pearsonian  Correlation  Coefficient 
and  the  Jaccard  Similarity  Coefficient  mathematically  consistent. 
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When  using  dichotomous  data,  the  correlation  coefficient  can 
be  reduced  to  a,  b,  c  and  d  terms 


VARIABLE/ 
EVENT  2 

VARIABLE /EVENT  1 

1 

0 

1 

a 

b 

a+b 

0 

c 

d 

c+d 

a+c 

b+d 

a+b+c+d 

Variable  1  = 

Variable  2  =  y{ 

n 

E  xiyi=a 


n  n 

a+b=J2  xi2=Yj  xi 

1^1  2*1 


a+c=jty*2=jty* 


n=a+b+c+d 
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Substituting  the  identities  given  above  into  Product  Moment 
Correlation  Coefficient: 

a-ja+b)  (a+c) 


{[(a+b)-  (a+±>)2]  [(a+c)  -  (a+c)--]H 
n  n 


r= 


an- ( a+b ) ( a+c ) 

_i 

{(a+b)  [n-(a+b)2]  (a+c)  [n-(a+c)2]}2 


r= 


_ ad-bc _ 

_i 

{(a+b)  (c+c?)  (a+c)  (b+d))2 


-l^rsl 


Hence,  the  Product  Moment  Correlation  Coefficient  eguals  the  Phi 
Coefficient. 

The  Jaccard  Similarity  Coefficient  is  being  used,  hence  all 
dissimilarities  are  removed  from  the  data.  The  dichotomous  data 
table  is  as  follows: 


VARIABLE/ 

VARIABLE /EVENT  1 

EVENT  2 

1 

0 

1 

a 

b 

a+b 

0 

c 

0 

c 

a+c 

b 

a+b+c 

Variable  1  =  xf 
Variable  2  =  yf 

The  Jaccard  Similarity  Coefficient  is  defined  as: 


J„= 


lj  a+b+c 


and  the  Correlation  Coefficient  for  the  Jaccard  Coefficient  is 
given  by: 
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The  range  of  Jr  is  from  0  to  1,  and  is  made  up  of  a  proportion 
of  pure  similarities  (1-1)  to  the  total  data  set  of  the  dichotomous 
data  being  compared  minus  dissimilarities  (0-0) .  The  Jaccard 
Coefficient  is  completely  void  of  magnitude  of  the  raw  data.  The 
raw  data  has  a  possible  range  of  ±  ®.  To  further  complicate  the 
data,  the  Correlation  Coefficient  for  the  Jaccard  Coefficient 
ranges  from  -1  to  0,  and  is  strongly  dependent  on  the  1-0  and  0-1 
similarities.  Note  that  the  matching  coefficient  is  discontinuous 
if  b  or  c  is  zero  (0) .  Opposed  to  the  Pearsonian  Correlation 
Coefficient  which  has  range  of  -1  <  rxy  <  +1  and  is  dependent  on  a 
calculated  Euclidean  distance. 
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E.  Results  of  Analysis 

Factor  Analysis  is  one  procedure  that  is  very  useful  in  those 
situations  in  which  one  wants  to  reduce  the  number  of  variables 
under  consideration  while  at  the  same  time  retain  as  much  subject - 
to-subject  variability  as  is  possible.  In  most  of  the  literature 
concerning  Factor  Analysis,  it  is  suggested  that  binary  data  not  be 
used.  And,  it  is  strongly  recommended  that  data  of  this  type  not  be 
used  when  manipulating  large  data  sets.  This  is  due  to  the  fact 
that  the  correlation  matrix  is  based  on  the  Euclidean  distance  of 
the  vectors  in  the  plane  and/or  the  covariance  matrix  which  is 
based  on  the  variance  of  the  data  being  examined.  Hence,  the  use  of 
binary  data  (dichotomous)  is  not  recommended. 

Cluster  Analysis,  however,  is  the  grouping  of  similar 
variables  using  data  from  the  cases.  It  is  part  of  the  general 
scientific  process  of  searching  for  patterns  in  data  and  then 
trying  to  construct  laws  that  explain  the  pattern.  Clustering  can 
compare  case  to  case  situations  and  would  be  appropriate  when 
working  with  dichotomous  data.  The  data  will  form  oblique- 
transformed  data  set  within  the  cluster. 

The  essence  of  the  clustering  method  consists  of  representing 
the  factors  by  reference  axes  passing  through  the  centriods  of  the 
respective  groups  of  variables  (clusters) .  Since  the  clusters  of 
variables  would  not  ordinarily  be  at  right  angles  to  one  another, 
it  can  be  assumed  that  the  common  factors  within  an  individual 
cluster  is  not  orthogonal.  We  must  assume  some  criterion  for 
establishing  some  Cluster  Structure  Matrix  as  a  starting  point, 
with  communal ities  in  the  principal  diagonal.  The  actual  analysis 
begins  with  an  appropriate  clustering  of  n  variables  into  m 
clusters  on  some  a  priori  basis  or  purely  arbitrary  basis.  For  the 
"Night  Study  Data",  we  chose  to  establish  the  commonality  on  the 
eight  (8)  Problem  Areas  (PA1  through  PA8) . 

One  major  problem  that  occurs  when  clustering  techniques  are 
being  used  is  scaling.  The  effect  of  scaling  can  depend  very  much 
on  the  skewness  of  the  data.  Dichotomous  data  can  reflect  a  highly 
skewed  binary  variable  taking  values  1  and  0  with  relative 
frequencies  p  and  1-p  in  the  n  objects.  The  simple  variance  p(l-p) 
will  be  less  than  one  (1)  and  division  by  [p(l-p)]1/2  will  inflate 
the  importance  of  the  variable. 

To  overcome  not  only  the  scaling  problem,  but  also  correlation 
effects  among  the  variables,  the  Mahalanobis  distance 

A  (  Xz ,  Xg)  =  [  ( XZ~Xg )  ' S _l  (XZ~Xg)  ]  ^ 
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where 


(Xj^-x)  (x±-x) ' 
(n-1) 


was  implemented  as  a  distance  measure.  This  measure  is  invariant  to 
affine  transformation. 

Suppose  that  the  data  consists  of  to  similar,  well-separated 
spherical  clusters,  each  with  nk  points  and  centriods  x*  (k  =  1,2) 
as  indicated  in  Figure  1.  The  grand  mean 

-  (wg 

(n^n 2) 


will  lie  on  the  line  L  joining  the  two  centriods.  The  first 
principle  component  is  a'x,  where  a  the  direction  of  the  line 
through  x  that  minimizes  the  sum  of  the  squares  of  the  distances  of 
all  the  points  on  the  line.  Clearly  this  line  will  be  close  to  L 
and  the  second  component  will  correspond  approximately  to  the  line 
M  through  x  perpendicular  to  L.  Since  the  distances  from  M  are  much 
greater  than  from  L,  the  first  principle  component  (obtained  by 
projecting  orthogonally  onto  L)  will  have  a  large  sample  variance 
compared  with  the  second  component:  These  two  variances  are  the 
eigenvalues  of  S. 
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Figure  1. 


Two  similar  well— separated  spherical  clusters. 

Using  the  Mahalanobis  distance  instead  of  the  of  the  Euclidean 
distance ,  we  are  effectively  replacing  the  x{  by  S'1/2x,-,  which  have 
a  sample  covariance  matrix  s'1/ZSS*1/2  =  Id.  This  mean  that  the  two 
principal  components  are  standardized  to  have  equal  sample 
variances.  With  this  new  scaling,  the  within-c luster  distances 
increase  relative  to  the  between-c luster  distances  and  the  cluster 
become  less  distinct.  In  our  analysis,  the  point  of  interest  is 
the  within-cluster  distances. 

The  VARCLUS  Procedure  was  selected  from  SAS  to  perform  the 
Night  Study  Analysis.  The  procedure  meets  all  the  mathematical 
criterion  and  overcomes  the  restrictions  stated  in  this  section.  In 
the  next  section  a  comparison  of  procedures  will  be  made  and 
statistical  evaluation  of  the  output  products  of  the  previous 
procedure  and  the  VARCLUS  Procedure. 
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Section  2:  Methodology  -  Modified  Factor  Analysis 

A.  Comparison  of  Methods  Used 

FACTOR  ANALYSIS 

1.  The  original  data  is  reduced  in  the  number  of  variables  which 
will  be  used  for  final  analysis. 

a.  The  maximum  n  is  determined,  where  n  is  the  number  of 
times  a  variable  is  present. 

b.  All  of  the  variables  are  compared  to  one  another  to 
yields  a  matrix  composed  of  the  simultaneous  occurrences 
of  each  paired  set  of  variables. 

c.  A  Jaccard  coefficient  Matrix  is  formed  and  those  matches 
which  have  Jaccard  values  are  either  combined  to  form  one 
variable  or  the  variable  may  be  discarded. 

2.  Manipulate  the  original  data  before  putting  it  into  the  SAS 
program. 

a.  Correlation  Matrix 

b.  Covariance  Matrix 

Note:  This  is  were  USASC  placed  the  Jaccard  Similarity 
Matrix  (Discussed  in  Section  1) . 

3.  Once  the  data  is  in  the  proper  form  the  SAS  FACTOR  procedure 
is  implemented. 

a.  The  SAS  Program  used 

DATA  CORREL  (TYPE  =  CORR) ; 

TYPE_  =  'CORR'; 

INFILE  SASONE; 

INPUT  NAME  $  FAT  OBS  ILL  U60  UH1  A64  ADM  PRO  TAC  AID 

UNA  PA1  PA2  PA3  PA4  PA5  PA6  PA7  PA8  TE8 

T10  Til  PMS  LND  CRU  HOV  FIV  EXP; 

PROC  PRINT  * 

PROC  FACTOR  METHOD  =  PRINCIPAL  N=8  ROTATE  =  VARIMAX  MSA 
OUTSTAT  =  FACT  DATA  =CORREL  SCORE; 
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DATA  RAW; 

INFILE  SASTWO; 

INPUT  _NAME_  SEQ  FAT  OBS  ILL  U60  UH1  A64  CH47  H6  ADM 

PRO  TAC  AID  UNA  PA1  PA4  PA8  PA6  PA5 

PA2  PA3  PA7  PCT  PNC  TE8  T10  Til  PMS 

LND  CRU  HOV  FIV  EXP; 

PROC  PRINT  * 

PROC  SCORE* DATA  =  RAW  SCORE  =  FACT  OUT  =  SCORES; 

PROC  PRINT  DATA  =  SCORES; 

RUN; 

b.  The  number  of  factors  used  can  then  be  chosen  in  the 
following  manners: 

i.  variance  explained 

ii.  when  specifically  chosen  variables  are  contained 
in  separate  factors 

iii.  when  factor  pattern  coefficients  fit  a 
predetermined  condition 

iv.  knowledge  of  data 

v.  inter-factor  correlations  are  at  a  minimum. 

c.  Once  a  number  of  factors  are  chosen,  the  Factor  Pattern 
Matrix  is  multiplied  with  the  Data  Matrix.  The  resultant 
matrix  is  the  Scoring  Matrix. 

d.  When  the  Scoring  Matrix  is  obtained  it  consists  of  a 
score  for  each  case  in  relation  to  every  factor. 

e.  Cases  are  then  separated  into  factors  depending  upon 
which  factor,  a  case,  had  the  highest  score. 

4.  After  the  cases  are  assigned  to  there  perspective  factors 
the  cases  can  then  be  analyzed  to  determine  if  there  is  any 
sort  of  pattern  within  a  factor. 

The  VARCLUS  procedure  has  many  options  which  can  be 
programmed  depending  upon  the  desired  results  the  researcher  is 
hoping  to  obtain.  The  procedure  used  for  this  report  did  not  use 
any  special  options.  The  procedure  was  allowed  to  run  until  the 
clusters  contained  only  one  or  two  variables.  The  researcher  can 
specify  the  maximum  or  minimum  number  of  clusters  desired. 
Clusters  can  also  be  separated  based  on  several  different  methods. 
The  method  used  here  was  the  centroid  method  because  it  allowed  for 
no  interaction  between  the  clusters.  All  of  the  options  available 
are  listed  in  the  sas  User's  Guide:  Statistics.  Edition  5  Chapter 
40  page  801. 
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VARCLUS 


1.  The  original  data  is  reduced  in  the  number  of  variables  which 
will  be  used  for  final  analysis. 

a.  The  maximum  n  is  determined,  where  n  is  the  number  of 
times  a  variable  is  present. 

b.  Then  all  of  the  variables  are  compared  to  one  another  to 
yield  a  matrix  composed  of  the  simultaneous  occurrences 
of  each  paired  set  of  variables. 

c.  Then  a  Jaccard  Coefficient  Matrix  is  formed  and  those 
matches  which  have  Jaccard  values  are  either  combined  to 
form  one  variable  or  the  variable  may  be  discarded. 

2 .  Once  the  data  is  reduced  the  SAS  VARCLUS  procedure  is 
implemented. 

a.  SAS  Program  used 

DATA  SAFETY; 

SET  NGT.EVT; 

RUN; 

PROC  VARCLUS  DATA=SAFETY  SUMMARY  OUTSTAT=CLSTR ; 

RUN; 

PROC  TREE; 

RUN; 

b.  The  number  of  clusters  used  can  then  be  chosen  m  the 
following  manners: 

i.  variance  explained 

ii.  when  specifically  chosen  variables  are  contained 
in  separate  clusters 

iii.  when  cluster  structure  coefficients  fit  a 
predetermined  condition 

iv.  knowledge  of  data 

v.  inter-cluster  correlations  are  at  a  minimum. 

c.  Once  a  number  of  clusters  are  chosen,  the  Cluster . 
Structure  Matrix  is  multiplied  with  the  Data  Matrix.  The 
resultant  matrix  is  the  Scoring  Matrix  (Refer  to  next 
page  for  Sorting  Criterion) . 

d.  Once  a  Scoring  Matrix  is  obtained  it  consists  of  a  score 
for  each  case  in  relation  to  every  cluster . 

e.  Cases  are  then  separated  into  clusters  depending  upon 
which  cluster  a  case  had  the  highest  score. 

3.  After  the  cases  are  assigned  to  there  perspective  clusters, 
the  variables  can  then  be  analyzed  to  determine  if  there  is 
any  sort  of  pattern  within  a  cluster. 
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Sorting  Criterion 


The  Sorting  Procedure  is  composed  of  a  series  of  matrix 
manipulations.  By  definition: 


The  Cluster  Structure  Matrix  is  made  up  of  linear  combination 
of  the  variables  contained  in  the  cluster  being  examined 
(Cluster  Coefficients  x  Variables) . 

The  Data  Matrix  is  composed  of  dichotomous  data  identifying 
the  variables  occurring  in  each  case  (Variables  x  Cases) . 

The  Scoring  Matrix  is  created  by  the  multiplication  of  the 
Cluster  Structure  Matrix  and  the  Data  Matrix.  The  resultant 
matrix  is  composed  of  Cluster  Coefficients  (row  vectors)  and 
Cases  (column  vectors) .  Giving  a  Scoring  Matrix  defined  as 
Cluster  Coefficient  associated  to  a  variable  identified  for  a 
particular  case. 


Example: 


0.43  0.25 

0.79  -0.32 

-.072  0.89 

0.65  0.02 


0.65 

0.09 

-0.45 

-0.89 


1 
x  0 
0 


0 

1 

1 


1 

1 

1 


043  0.25 

°_  0.79  -0.32 

°"-0.72  0.89 

1  0.65  0.02 


0.65 

0.09 

-0.45 

-0.89 


The  Scoring  Matrix  in  this  study  was  created  by  the  use  of  LOTUS 
Software  by  selecting  the  defined  Cluster  Structure  Matrix  and  the 
defined  Data  Matrix  and  multiplying  the  two  matrixes.  It  should  be 
noted  that  a  SCORE  (Sorting)  Procedure  is  available  in  SAS  (SAS 
User's  Guide:  Statistics.  Version  5  Edition  Chapter  34  page  735) . 
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B.  Comparison  of  Procedures 

One  noted  difference  between  the  VARCLUS  procedure  and  the 
factor  analysis  is  the  treatment  of  the  original  data.  In  factor 
analysis  the  first  step  is  to  form  a  correlation  matrix  which 
compares  variable  to  variable  until  every  combination  has  been 
made.  In  the  VARCLUS  procedure  the  first  step  is  to  plot  each  case 
as  a  point  on  a  multidimensional  hyper-space.  Each  variable  is 
treated  as  though  it  were  one  plane  in  the  hyper-space. 

The  advantage  of  the  VARCLUS  procedure  is  that  it  does  not 
compare  the  variables  themselves.  Instead  this  procedure  compares 
the  cases  and  then  determines  which  variables  are  driving  those 
cases.  When  the  data  is  compared  case  to  case,  it  is  now  truly 
dichotomous  data. 

A  second  noted  difference  is  the  use  of  the  correlation 
coefficient.  As  indicated  in  Section  1  of  this  report,  factor 
analysis  uses  the  correlation  coefficient  matrix  as  the  foundation 
for  all  of  the  calculations,  which  was  proven  mathematically 
incorrect  due  to  lack  of  dichotomy  in  the  data.  In  the  VARCLUS 
procedure  the  correlation  coefficient  is  used  to  separate  variables 
into  clusters,  however  it  is  calculated  between  the  centroid  of  a 
cluster  and  the  variable  in  question.  If  a  variable  has  a 
relatively  small  coefficient  when  compared  to  the  cluster  it  is  in, 
that  variable  then  becomes  part  of  a  new  cluster.  This  eliminates 
the  original  problem  of  dichotomous  data  because  when  the  cases  are 
compared  the  variables  are  then  truly  dichotomous. 

The  third  difference  involves  the  Factor  Pattern  Matrix  and 
the  Cluster  Structure  Matrix.  Both  matrixes  are  a  linear 
representation  of  each  variable  with  respect  to  the  individual 
factors  or  clusters.  However,  the  factor  pattern  for  each  factor 
can  be  squared  and  summed  which  will  yield  the  amount  of  variance 
explained  for  that  factor.  This  can  also  be  done  with  the  Cluster 
Structure  Matrix  except  that  not  all  of  the  variables  are  squared 
and  summed.  Only  those  variables  which  are  present  in  that  cluster 
are  used  to  determine  the  amount  of  variance  explained.  Not  all  of 
the  variables  are  used  to  determine  variance  explained  because  they 
are  not  all  present  in  the  particular  cluster  being  examined.  The 
following  eight  (8)  tables  contain  the  linear  equations  for  each 
cluster  and  each  factor  which  is  obtained  from  the  Cluster 
Structure  Matrix  and  the  Factor  Pattern  Matrix. 
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Table  1 
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VARIABLE 


TE11 


CLUSTER  2 


CLUSTER  2 
SQUARED 


FACTOR  3 


FACTOR  3 
SQUARED 


0.8534 


-0.0853 


-0.2907 


-0.1257 


-0.1257 


0.9330 


-0.1366 


-0.1754 


-0.1200 


-0.1887 


-0.1801 


-0.0796 


0.9330 


0.8704 


0.8704 


-0.0340 


0.8488 


-0.0409 


-0.1036 


-0.0025 


-0.0448 


-0.0094 


-0.0026 


0.8630 


0.0721 


0.0370 


0.0033 


0.0012 


0.7204 


0.0107 


0.0000 


0.0020 


0.0001 


0.0000 


0.7448 


0.0052 


0.0014 


0.0005 


0.0004 


0.0000 


VARIABLE 


TE10 


VARIANCE 

EXPLAINED 


CLUSTER  3 


-0.0152 


0.0213 


-0.0726 


-0.1001 


-0.1262 


-0.0575 


-0.0348 


0.7512 


-0.0270 


-0.1241 


0.8200 


-0.0816 


-0.0687 


0.0766 


-0.0237 


-0.0615 


0.6521 


-0.1152 


CLUSTER  3 
SQUARED 


0.5643 


0.6725 


FACTOR  6 


0.4252 


-0.0256 


0.0701 


0.0229 


-0.0327 


0.0444 


-0.0136 


-0.0882 


0.1582 


0.0593 


0.0162 


0.1270 


-0.0200 


-0.0056 


0.0020 


-0.0020 


0.0078 


-0.0049 


0.7822 


0.0165 


0.0040 


0.7785 


-0.0041 


0.0115 


0.1122 


0.0566 


0.0405 


0.3813 


-0.0128 


FACTOR  6 
SQUARED 


0.0007 


0.0049 


0.0005 


0.0011 


0.0020 


0.0002 


0.0078 


0.0250 


0.0035 


0.0003 


0.0161 


0.0000 


0.0000 


0.0000 


0.0001 


0.0000 


0.6118 


0.0003 


0.0000 


0.6061 


0.0000 


0.0001 


0.0126 


0.0032 


0.0016 


0.1454 


0.0002 


1.4438 
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Table  4 
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VARIABLE 


AH64 


AD 


TE10 


VARIANCE 

EXPLAINED 


CLUSTER  5 


-0.2524 


-0.1021 


0.4527 


0.4936 


-0.4728 


0.0307 


-0.3883 


-0.2783 


0.5325 


0.5947 


-0.5947 


-0.0536 


0.0005 


0.4213 


0.0597 


■0.0786 


0.0398 


0.0116 


-0.1781 


0.0371 


0.0925 


-0.1409 


0.4028 


CLUSTER  5 
SQUARED 


0.2050 


0.2835 


0.1623 


1.4255 


FACTOR  1 

FACTOR  1 
SQUARED 

-0.0113 

0.0001 

0.4614 

0.2128 

0.7206 

0.5193 

0.5367 

0.2881 

0.0494 

0.0024 

0.0211 

0.0004 

-0.0114 

0.0001 

0.2770 

0.0767 

0.7099 

0.5040 

0.7877 

0.6205 

0.0998 

0.0100 

0.4363 

0.1903 

0.0538 

0.0029 

0.1785 

0.0319 

0.1770 

0.0313 

0.0385 

0.0015 

0.1700 

0.0289 

0.0023 

0.0000 

0.1208 

0.0146 

0.3700 

0.1369 

-0.0343 

0.0012 

0.0948 


0.0525 


0.1953 


0.5011 


0.1177 


0.2164 


0.1516 


0.0090 


0.0028 


0.0381 


0.2511 


0.0139 


0.0468 


0.0230 


3.0587 
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Table  6 


VARIABLE 


PA4 


PA5 


PA6 


PA7 


PA8 


E8 


TE10 


TE11 


PMS 


LND 


CRU 


HOV 


FIV 


EXP 


CLUSTER  6 


0.4851 


-0.1864 


-0.2071 


0.6240 


-0.2013 


0.6792 


-0.1188 


-0.4075 


-0.6550 


0.6550 


-0.1140 


-0.1177 


-0.1330 


-0.1580 


0.0946 


-0.0260 


-0.0501 


0.5015 


0.2949 


-0.0319 


-0.0448 


0.0964 


0.2211 


-0.1178 


-0.1907 


0.0708 


-0.0101 


CLUSTER  6 
SQUARED 


0.2354 


FACTOR  2 


FACTOR  2 
SQUARED 


0.3894 


0.4614 


0.3531 

0.1247 

0.1753 

0.0307 

0.0071 

0.0000 

0.6645 

0.4416 

0.0214 

0.0005 

0.6799 

0.4622 

0.2474 

0.0612 

-0.0581 

0.0034 

-0.0172 

0.0003 

0.6691 

0.4476 

0.3055 

0.0933 

0.0896 

0.0080 

-0.0420 

0.0018 

0.0358 

0.0013 

0.1237 

0.0153 

0.1373 

0.0189 

0.0235 

0.0005 

0.2306 

0.0532 

0.2517 

0.0634 

-0.0577 

0.0033 

0.0715 

0.0051 

0.1182 

0.0140 

0.4232 

0.1791 

0.0623 

0.0039 

-0.0122 

0.0001 

0.2383 

0.0568 

0.0665 

0.0044 

Table  7 
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Table  8 


VARIABLE 


UH60 


8 


TE8 


TE10 


TE11 


PMS 


Variance 

Explained 


CLUSTER  8 


0.0168 


-0.0696 


-0.0962 


-0.0794 


0.0768 


-0.0086 


0.0710 


-0.0703 


0.0146 


-0.0905 


0.0905 


-0.1702 


0.8044 


-0.0786 


-0.1581 


-0.1531 


-0.0306 


-0.0047 


0.0354 


-0.1350 


0.0055 


-0.0853 


-0.1646 


0.8044 


-0.2623 


-0.1771 


0.0049 


0.0104 


CLUSTER  8 
SQUARED 


0.6471 


0.6471 


FACTOR  8 

FACTOR  8 
SQUARED 

-0.0670 

0.0045 

0.0194 

0.0004 

0.0571 

0.0033 

0.0279 

0.0008 

-0.0349 

0.0012 

0.0389 

0.0015 

0.0592 

0.0035 

-0.2383 

0.0568 

0.1997 

0.0399 

0.0711 

0.0051 

0.0540 

0.0029 

-0.0236 

0.0006 

0.6120 

0.3746 

0.2479 

0.0614 

-0.0445 

0.0020 

-0.0520 

0.0027 

-0.4366 

0.1906 

0.0013 

0.0000 

0.1138 

0.0130 

-0.2175 

0.0473 

0.0680 

0.0046 

0.0136 

0.0002 

0.0397 

0.0016 

0.3948 

0.1559 

-0.2975 

0.0885 

-0.0343 

0.0012 

-0.1096 

0.0120 

0.0858 


1.1617 


The  Factor  Pattern  Matrix  and  the  Cluster  Structure  Matrix  can 
both  be  used  to  assign  cases,  which  are  aircraft  accidents  in  our 
investigation,  to  specific  factors  or  clusters.  If  the  Factor 
Pattern  Matrix  is  multiplied  with  the  original  data  matrix,  the 
result  is  a  score  for  each  case  in  each  factor.  Cases  are  assigned 
to  the  factors  based  on  the  sum  of  the  multiplication  of  the 
rotated  Factor  Pattern  Matrix  and  the  original  data  matrix.  Once 
the  case  has  been  assigned  to  a  specific  factor,  the  groups  of 
cases  can  then  be  further  analyzed  to  determine  any  trends  the 
variables  may  show.  This  same  procedure  can  be  accomplished  using 
the  Cluster  Structure  Matrix. 

The  final  difference  involves  the  transformation  used  in  each 
of  the  procedures.  In  the  VARCLUS  procedure,  after  a  group  of 
clusters  are  formed,  a  oblique  transformation  is  used  to  calculate 
the  distance  of  the  variable  within  the  particular  cluster .  In  the 
factor  analysis,  a  VARIMAX  rotation  is  done  after  the  initial 
oblique  transformation  is  accomplished,  to  maximize  the  variance  of 
the  squared  loadings  for  each  factor. 
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C.  Comparison  of  Output 

The  first  section  of  the  SAS  output  from  the  VARCLUS  procedure 
contains  the  breakdown  of  variables  into  clusters.  Each  cluster 
contains  a  group  of  variables  whose  R-squared  is  closest  to  that 
cluster  than  any  other.  Another  output  of  the  VARCLUS  procedure  is 
the  Cluster  Structure  Matrix  explained  in  the  previous  section. 

The  steps  of  this  procedure  are  repeated  until  each  cluster 
contains  only  one  or  two  variables  or  unless  otherwise  specified. 
(In  the  example  that  follows,  the  night  study  data  from  the  USASC 
was  used  and  compared  to  the  factor  analysis  which  was  performed  on 
the  same  data.)  In  the  VARCLUS  procedure  eight  (8)  clusters  were 
used  because  at  that  point  the  variables  representing  problem  areas 
were  split  into  separate  clusters.  The  following  is  a  list  of 
which  variables  were  contained  in  which  cluster: 


Table  9. 


CLUSTER 

VARIABLE 

Cluster  1 

PA5 

_ EMS _ 

Cluster  2 

PA4 

_ TE13 _ 

Cluster  3 

PA7  TE10 

FIV 

Cluster  4 

AH64  PA1 

_ PRO  HOV _ 

Cluster  5 

ILL  TAC  UH60 

ATP  PA3  EXP _ 

Cluster  6 

FAT  OBS  UH1 

UNA  PAS  ADM - 

Cluster  7 

PA6 

_ TE8  CRU _ 

Cluster  8 

PA2 

The  next  step  is  to  take  the  original  data  matrix  and  multiply 
it  by  the  cluster  structure  matrix.  The  result  of  this 
multiplication  is  a  scoring  matrix  which  determines  which  cluster 
the  cases  belong.  Once  the  cases  are  put  into  their  respective 
cluster  those  cases  are  further  analyzed. 

To  determine  if  the  clusters  were  an  accurate  description  of 
the  factors  which  have  been  previously  used,  a  comparison  between 
VARCLUS  clusters  and  factor  analysis  factors  was  made.  A  cluster 
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was  matched  to  a  factor  based  on  variance  explained  and  variables 
contained  in  each  cluster.  An  example  is  the  match  between  Cluster 
1  and  Factor  5  (Refer  to  Table  1)  .  In  Cluster  1,  the  variables  PA5 
and  PMS  contributed  to  the  variance  explained.  In  Factor  5,  these 
same  two  variables  contributed  the  most  to  the  variance  explained. 


The  following  tables  represent  the  distribution  of  variables 
from  the  cases  in  those  clusters  and  factors.  A  test  of 
proportions  was  used  to  determine  if  the  proportions  in  the 
clusters  were  equivalent  to  the  proportions  in  the  factors.  The 
test  statistic  for  this  test  was  the  z-test. 


Z  = 


*2 

*1  *2 


N 


p(i-p)  (4-+4-) 


Nx  Nx 


where 


P  = 


n^n2 


The  tests  are  based  upon  a  95%  confidence  interval  which  would 
give  a  critical  z  value  of  ±  1.96.  Therefore  any  z  value  which  is 
larger  than  1.96  or  less  than  -1.96  does  not  support  the  hypothesis 
that  the  proportions  are  equal.  The  shaded  values  in  the  following 
tables  are  those  values  which  do  not  support  the  hypothesis. 
(STATISTICS;  A  First  Course.  Freund  and  Smith,  page  368.) 


Hn 


Plm*2 


Ha  :  Pj_*P2 
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CLUSTER  1 


VAR 

% 

OBS 

42.9% 

ILL 

100.0% 

OIL 

42.9% 

UH60 

14.3% 

TAC 

57.1% 

UNA 

57.1% 

PA5 

87.5% 

PCT 

71.4% 

TE8 

87.5% 

CASES 

3.8% 

FACTOR  5 


% 


50.0% 


87.5% 


37.5% 


25.0% 


50.0% 


50.0% 


87.5% 


87.5% 


75.0% 


4.3% 


P-HAT 


0.47 


Z -VALUE 


-0.27 


0.93 

0.97 

0.40 

0.21 

0.20 

-0.52 

0.53 

0.27 

0.53 

0.27 

0.88 

0.00 

0.80 

-0.78 

0.81 

0.61 

Table  11. 


CLUSTER 

2 

FACTOR 

3 

VAR 

% 

# 

% 

# 

P-HAT 

Z -VALUE 

OBS 

62.5% 

15 

55.6% 

10 

0.60 

0.45 

ILL 

75.0% 

18 

66.7% 

12 

0.71 

0.59 

OIL 

45.8% 

11 

33.3% 

6 

0.40 

0.82 

UH60 

41.7% 

10 

33.3% 

6 

0.38 

0.55 

PRO 

50.0% 

12 

50.0% 

9 

0.50 

0.00 

AID 

79.2% 

19 

72.2% 

13 

0.76 

0.53 

PA4 

100.0% 

24 

100.0% 

18 

1.00 

0.00 

PCT 

70.8% 

17 

77.8% 

14 

0.74 

-0.51 

TE11 

75.0% 

18 

100.0% 

18 

0.86 

lIBlii  is 

CASES 

13.0% 

24 

9.8% 

18 

Table  12 


Table  13. 
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Table  14 


CLUSTER 

5 

FACTOR 

1 

VAR 

% 

# 

% 

* 

P-HAT 

Z -VALUE 

OBS 

36.5% 

23 

48.2% 

40 

0.43 

-1.41 

ILL 

87.3% 

55 

89.2% 

74 

0.88 

-0.35 

OIL 

34.9% 

22 

45.8% 

38 

0.41 

-1.33 

UH60 

44.4% 

28 

43.4% 

36 

0.44 

0.12 

TAC 

88.9% 

56 

71.1% 

59 

0.79 

2.51 

AID 

95.2% 

60 

90.4% 

75 

0.92 

1.09 

PA1 

38.1% 

24 

42.2% 

35 

0.40 

-0.50 

PCT 

77.8% 

49 

72.3% 

60 

0.75 

0.76 

TE8 

30.2% 

19 

39.8% 

33 

0.36 

-1.20 

CRU 

28.6% 

18 

42.2% 

35 

0.36 

-1.69 

CASES 

34.2% 

63 

45.1% 

83 

37 


Table  15 


Table  16. 
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Table  17. 


CLUSTER 

8 

FACTOR 

8 

VAR 

% 

* 

% 

# 

P-HAT 

Z -VALUE 

ILL 

62.5% 

5 

66.7% 

8 

0.65 

-0.19 

UH1 

25.0% 

2 

8.3% 

1 

0.15 

1.03 

TAC 

75.0% 

6 

75.0% 

9 

0.75 

0.00 

AID 

62.5% 

5 

66.7% 

8 

0.65 

-0.19 

PA2 

87.5% 

7 

83.3% 

10 

0.85 

0.26 

PCT 

87.5% 

7 

91.7% 

11 

0.90 

-0.31 

LND 

100.0% 

8 

75.0% 

9 

0.85 

1.53 

CASES 

4.3% 

8 

6.5% 

12 

According  to  the  tables,  the  only  cluster  to  factor  pairing 
which  may  not  be  considered  a  good  natch  is  the  Cluster  4  to  Factor 
4  (Table  13) .  This  would  indicate  that  the  PROC  VARCLUS  was  able 
to  reproduce  the  sane  results  as  the  factor  analysis  procedure. 
However  the  nathenatical  difficulties  have  been  overcone  with  the 
use  of  the  VARCLUS  procedure. 
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II.  Conclusion 

Many  studies  conducted  by  the  U.S.  Army  Safety  Center  involve 
the  measurement  of  a  large  number  of  variables  on  different  cases. 
Factor  Analysis  is  one  procedure  that  is  very  useful  in  those 
situations  in  which  one  wants  to  reduce  the  number  of  variables 
under  consideration  while  at  the  same  time  retain  as  much  subject - 
to-subject  variability  as  is  possible.  In  most  of  the  literature 
concerning  Factor  Analysis,  it  is  suggested  the  binary  data  not  be 
used,  and  it  is  strongly  recommended  that  data  of  this  type  not  be 
used  when  manipulating  large  data  sets.  This  is  due  to  the  fact 
that  the  correlation  matrix  is  based  on  the  Euclidean  distance  of 
the  vectors  in  the  plane  and/or  the  covariance  matrix  which  is 
based  on  the  variance  of  the  data  being  examined.  Hence,  the  use  of 
binary  data  (dichotomous)  is  not  recommended. 

Cluster  Analysis,  however,  is  the  grouping  of  similar 
variables  using  data  from  the  cases.  It  is  part  of  the  general 
scientific  process  of  searching  for  patterns  in  data  and  then 
trying  to  construct  laws  that  explain  the  pattern.  Clustering  can 
compare  case  to  case  situations  and  would  be  amicable  to 
dichotomous  data.  The  data  will  form  oblique— transformed  data  set 
within  the  cluster. 

The  VARCLUS  procedure  has  many  options  which  can  be 
programmed  depending  upon  the  desired  results  the  researcher  is 
hoping  to  obtain.  The  procedure  used  for  this  report  did  not  use 
any  special  options.  The  procedure  was  allowed  to  run  until  the 
clusters  contained  only  one  or  two  variables.  The  researcher  can 
specify  the  maximum  or  minimum  number  of  clusters  desired. 
Clusters  can  also  be  separated  based  on  several  different  methods. 
The  method  used  here  was  the  centroid  method  because  it  allowed  for 
no  interaction  between  variables.  All  of  the  options  available  are 
listed  in  the  SAS  User’s  Guide;  Statistics,  chapter  40  page  801. 
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III.  Recommendations 

To  reiterate,  the  analysis  performed  in  this  report  was  done 
using  a  Clustering  Technique  (Centroid  Method)  to  duplicate  Factor 
Analysis  Procedure  established  by  the  USASC.  Basically  while 
performing  this  analysis,  we  found  ourselves  working  towards 
answers  already  achieved.  This  technique  allowed  a  thorough 
evaluation  of  the  mathematical  procedures  used  by  the  USASC.  Listed 
below  are  some  recommendations  for  further  analysis: 

1. )  Re-analyze  the  Crew  Co-ordination  Study  including 

Operation  Desert  Storm/Desert  Shield  Data  using 

Clustering  Techniques. 

2. )  Re-analyze  the  Night  Study  including  Operation  Desert 

Storm/Desert  Shield  Data  using  Clustering  Techniques. 

3 .  )  Investigating  the  various  options  given  with  the  VARCLUS 

Procedure. 

a.  INITIAL  =  SEED  assigns  to  cluster  variables  named  in 
the  SEED  statement.  The  other  variables  are  not 
specifically  assigned  until  the  clustering  technique 
is  performed. 

b.  HIERARCHY  requires  the  clusters  at  different  levels 
to  maintain  a  hierarchical  structure. 

4. )  Investigate  strategies  to  establish  a  criterion  for 

recognizing  a  demand  characteristic  for  clustering. 

5 .  )  Investigate  different  methods  to  calculate  the 

correlation/covariance  matrix  with  respect  to 

dichotomous  data  used  in  the  Safety  Center  Studies. 
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